<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><!--This file created 25.12.2004 22:19 Uhr by Claris Home Page version 3.0-->
  <title>Tight MPICH Integration in Grid Engine</title>
  <meta name="GENERATOR" content="Claris Home Page 3.0">
  <x-claris-window top="47" bottom="749" left="12" right="931"></x-claris-window>
  <x-claris-tagview mode="minimal"></x-claris-tagview>
</head>
<body bgcolor="#ffffff">
<p><font size="+1"><b>Topic:</b></font></p>
<p>Setup MPICH to get all child-processes killed on the slave
nodes.</p>
<p><font size="+1"><b>Author:</b></font></p>
<p>Reuti, <a href="mailto:reuti__at__staff.uni-marburg.de">reuti__at__staff.uni-marburg.de;</a>
Philipps-University of Marburg, Germany</p>
<p><font size="+1"><b>Version:</b></font></p>
<p>1.3 -- 2005-02-14 Minor change for MPICH 1.2.6, comments and
corrections are welcome<br>
1.3.1 -- 2010-11-22 Final version with minor adjustments</p>
<p><font size="+1"><b>Contents:</b></font></p>
<ul>
  <li>Symptom of this behaviour</li>
  <li>Explanation</li>
  <li>Solutions</li>
  <li>Number of tasks spread to the nodes</li>
  <li>What rsh command is compiled into a binary?</li>
  <li>Option -nolocal to mpirun</li>
  <li>Uneven distribution of processes to the slave nodes with two
network cards</li>
  <li>Wrong interface selected for the back channel of the MPICH-tasks
with the ch_p4-device</li>
  <li>Hint for ADF</li>
  <li>Hint for Turbomole</li>
</ul>
<p><font size="+1"><b>Note:</b></font></p>
<p>This HOWTO complements the information contained in the
$SGE_ROOT/mpi directory of the Grid Engine distribution.</p>
<p>
</p>
<hr><br>
<div style="text-align: center;"><big><big
 style="color: rgb(255, 0, 0);"><big><big>!!!
Warning !!!</big></big></big></big><br>
<br>
</div>
MPICH(1) is no longer maintained and the final version was 1.2.7p1. The
followup project MPICH2 is since version 1.3 based on Hydra as default
startup
method for their slaves tasks and other startup methods will be removed
over time. Hydra has a compiled in Tight Integration into SGE by
default, and no special setup in SGE to support MPICH2 jobs is
necessary any longer.<br>
<br>
Hydra will work out-of-the-box with a defined parallel environment
where start- and stop_proc_args are both set to NONE in the to be used
PE (in the essence, the same PE can now be used for Open MPI and
MPICH2), and in the jobscript a plain <span style="font-style: italic;">mpiexec</span>
will discover automatically the granted slots and nodes without any
further options. Nevertheless, in case that there is more than one MPI
installation in a cluster available, the correct <span
 style="font-style: italic;">mpiexec</span> corresponding to the
compiled application must be used as usual.<br>
<br>
This document just stays here in case you come across a legacy
installation and have to fix or setup such an older version.<br>
<p>
</p>
<hr><br>
<font size="+1"><b>Symptom of this behaviour</b></font>
<p></p>
<blockquote>You have parallel jobs using MPICH under SGE on LINUX. Some
of these jobs are killed nicely when you use qdel and don't leave any
running processes on the nodes behind. Other MPICH jobs are killed, but
the calculating tasks are still present after the job is killed and
consuming CPU time, while these jobs don't appear any longer in SGE.</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Explanation</b></font>
<p></p>
<blockquote>Every MPICH task on a slave, created by an rsh-command,
tries to become the process leader. If the MPICH task is just the child
of qrsh_starter, this is okay as it is already the process leader and
you can kill these jobs in the usual way. This is achieved, by killing
the child of qrsh_starter with "kill (-pid)" and hence the whole
process group will be killed. You can check this with the command:
  <pre>&gt; ps f -eo pid,uid,gid,user,pgrp,command --cols=120</pre>
  <p>If the startup script for the process on the slave nodes consists
of at least two commands (like the startup with Myrinet on the nodes),
a helping shell will be created,and the MPICH task will be a child of
this shell with a new PID, but still in the same process group. The odd
thing is now, that MPICH will discover this and enforce a new process
group with this PID, to become the process leader. So the MPICH tasks
jumps out of the creating process group and the intended kill of the
started process group will fail, leaving the calculation tasks running.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Solutions</b></font>
<p></p>
<blockquote>To solve this, there a various possibilities available,
which you may chose so that it fits best to your setup of SGE.<br>
  <p><b>1. Replace the helping shell with the MPICH task</b></p>
  <p>In case of e.g. Myrinet, the helping shell is created by one line
in mpirun.ch_gm.pl:</p>
  <pre>$cmdline = "cd $wdir ; env $varenv $apps_cmd[$i] $apps_flags[$i]";</pre>
  <p>You can prefix the final call to your program with an "exec", so
that the line looks like:</p>
  <pre>$cmdline = "cd $wdir ; exec env $varenv $apps_cmd[$i] $apps_flags[$i]";</pre>
  <p>With the "exec", the started program will replace the existing
shell, and so will stay to be the process leader.<br>
  </p>
  <p><b>2. Define MPICH_PROCESS_GROUP=no</b></p>
  <p>When this environment variable is defined, the startup of the
MPICH task won't create a new process group. Where it must be defines,
depends on the used version of the MPICH device:</p>
  <p>
  <table height="40" width="350" border="0">
    <tbody>
      <tr>
        <td width="50">
        <p align="right">ch_p4:</p>
        </td>
        <td width="3">
        <p></p>
        <br>
        </td>
        <td width="230">
        <p>must be set on the master node and the slaves</p>
        </td>
      </tr>
      <tr>
        <td width="50">
        <p align="right">Myrinet:</p>
        </td>
        <td width="3">
        <p></p>
        <br>
        </td>
        <td width="230">
        <p>has only to be defined on the slaves</p>
        </td>
      </tr>
    </tbody>
  </table>
  <br>
So you may define it in any file, which will be sourced during a
noninteractive login to the slave nodes. To be set on the master node,
you have to define it in the submit script. If you want to propagate
this to the slave nodes (and avoid to change any login file), you will
have to edit the rsh-wrapper in the mpi directory of SGE, so that the
qrsh command used there will include -V, so that the variables are
available on the slaves:</p>
  <pre><code>echo $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd<br>exec $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd<br>else<br>echo $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd<br>exec $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd</code></pre>
  <p>should read:</p>
  <pre>echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd<br>exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd<br>else<br>echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd<br>exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd<br></pre>
  <p><b>3. Recompile MPICH</b></p>
  <p>When you have the source of the used programs, it is also possible
to disable the creation of a new process group from MPICH and remove
this behaviour completely.<br>
  </p>
  <blockquote><b>For MPICH 1.2.5.2 and 1.2.6:</b>
    <p>The file where it must be done is session.c in
mpich-1.2.5.2/mpid/util. Change the line:</p>
    <pre>#if defined(HAVE_SETSID) &amp;&amp; defined(HAVE_ISATTY) &amp;&amp; defined(SET_NEW_PGRP)</pre>
    <p>to</p>
    <pre>#if defined(HAVE_SETSID) &amp;&amp; defined(HAVE_ISATTY) &amp;&amp; defined(SET_NEW_PGRP) &amp;&amp; 0</pre>
    <p>This way you can easily go back at a later point in time.<br>
    </p>
  </blockquote>
  <p>Because the routine will be linked into your final program, you
have to recompile all your programs. It's not sufficient, just to
install this new version in /usr/lib (or your path to mpich), unless
you have used the shared libraries of MPICH. Whether any delivered
binary from a vendor uses the shared version of MPICH, or has them
statically linked in, you can check with the LINUX command ldd.</p>
  <p>Before you run ./configure for MPICH, you should export the
variable RSHCOMMAND=rsh, to get the desired rsh command compiled into
the libraries. To create shared libraries and use them during
compilation of a MPICH program, please refer to the MPICH documentation.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Number of tasks spread to the nodes</b></font>
<p></p>
<blockquote>There is a difference, how many calls to qrsh will be made
depending on the used version of MPICH. If you use MPICH with the ch_p4
device over ethernet, there will always be the first task started
locally without usage of any qrsh call. Instead (n-1) times will qrsh
only be called. Hence you can set "job_is_first_task" to true in the
definition of your PE and allow only (n-1) calls to qrsh by your job.
  <p>If you are using Myrinet, it's different. In this case exactly n
times the qrsh will be used. So set the "job_is_first_task" to false.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>What rsh command is compiled into a
binary?</b></font>
<p></p>
<blockquote>If you got only the binary from a vendor, and your are in
doubt what rsh command was compiled into the program, you can try to
get some more information with:
  <pre>&gt; strings &lt;programname&gt; | awk ' /(rsh|ssh)/ { print } '</pre>
  <p>This may give you more information than you like, but you should
at least get some clue about it.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Option -nolocal to mpirun</b></font>
<p></p>
<blockquote>You should avoid setting this option, neither in any
mpirun.args script, nor as option to the command. The first problem
would be, that the head node of the MPI job (this is not the master
node, but one of the selected slaves) will not be used at all, and so
the processes will only be spread to the other nodes in the list (if
you got just 2 slots on one machine it can't run at all this way). The
second problem will be the used rsh command: because the first rsh
command is just to start the head node (which would be on a slave this
way), the setting of P4_RSHCOMMAND is ignored.</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Uneven distribution of processes to the slave
nodes with two network cards</b></font>
<p></p>
<blockquote>As Andreas pointed out in <a
 href="http://gridengine.sunsource.net/servlets/ReadMsg?msgId=15741&amp;listName=users">http://gridengine.sunsource.net/servlets/ReadMsg?msgId=15741&amp;listName=users</a>,

you have to check whether the first scan of the machines file by MPICH
can remove an entry at all, because `hostname` may give a different
name than included in the machines file (because you are using a
host_aliases file). Depending on your setup of the cluster, it may be
necessary to change just one entry back to the one delivered by
`hostname`. If you change all entries back to the external interface
again, you program may use the wrong network for the communication by
MPICH. This may, or may not, be the desired result. Code to change just
one entry back to the `hostname` is a small addition to the
PeHostfile2MachineFile subroutine in startmpi.sh:
  <pre>PeHostfile2MachineFile()<br>{<br>    myhostname=`hostname`<br>    myalias=`grep $myhostname $SGE_ROOT/default/common/host_aliases | cut -f 1 -d " "`<br>   <br>    cat $1 | while read line; do<br>        # echo $line<br>       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`<br>       nslots=`echo $line|cut -f2 -d" "`<br>       i=1<br>       while [ $i -le $nslots ]; do<br>          # add here code to map regular hostnames into ATM hostnames<br>   <br>          if [ $host = "$myalias" ] ; then  # be sure to include " for the second argument<br>              echo $myhostname<br>              unset myalias<br>          else<br>              echo $host<br>          fi<br>   <br>          i=`expr $i + 1`<br>       done<br>    done<br>}</pre>
  <p>Don't include this, if you already changed the startmpi.sh for the
handling of Turbomole, you would get two times the external name. In
general, I prefer having one PE for each application. Thus copy the mpi
folder in $SGE_ROOT and name it e.g. turbo, adf, linda etc. and create
corresponding PEs.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Wrong interface selected for the back channel of
the MPICH-tasks with the ch_p4-device</b></font>
<p></p>
<blockquote>With the hints given above, you may achieve the goal of an
even distribution of your MPICH tasks, but the used network interfaces
for the communication of the slave processes to the master node may be
the wrong one. You will notice this, when you have a look at the
program calls to the slave processes. Assume 'ic...' is the internal
network (which should be used) and 'node...' the external one for the
NFS traffic. Then the call:
  <pre>rsh ic002 -l user -x -n /home/user/a.out node001 31052 \-p4amslave \-p4yourname ic002 \-p4rmrank 1 </pre>
  <p>will use the wrong interface for the information sent back to the
master. It should have the form:</p>
  <pre>rsh ic002 -l user -x -n /home/user/a.out ic001 33813 \-p4amslave \-p4yourname ic002 \-p4rmrank 1</pre>
  <p>This can be changed, when you set and export the $MPI_HOST
environment variable by hand before the call to mpirun to be the name
of the internal interface on the execution node, and not the (default)
`hostname` for the external interface. You can also do it for all users
in the mpich.args where $MPI_HOST will be set, if it was unset before
the call to mpirun and change:</p>
  <pre>MPI_HOST=`hostname`</pre>
  <p>to:</p>
  <pre>MPI_HOST=`grep $(hostname) $SGE_ROOT/default/common/host_aliases | cut -f 1 -d " "`</pre>
  <p>In case that you don't use a host_aliases file at all, but still
want to change to another interface, you can e.g. use a sed command
like:</p>
  <pre>MPI_HOST=`hostname | sed "s/^node/ic/"`</pre>
  <p>If you decide to use this way, then take care with given "Hint for
Turbomole" because you also have to take in account the internal
hostname there and use it. Or you will not have to apply the change of
"Uneven Distribution of processes...", if you already changed the
$MPI_HOST for mpirun. This all highly depends on your setup of your
cluster and desired use of your network interfaces. So this all <i>might</i>
only be places, where you <i>can</i> change your setup; there is no
golden rule, which will fit for all configurations. Just implement them
and have a look at some test jobs, how they are distributed to the
nodes and which interfaces they are using.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Hint for ADF</b></font>
<p></p>
<blockquote>Some programs (like ADF parallel with MPICH) have a hard
coded /usr/bin/rsh inside for their rsh command. To access any wrapper
and let SGE take control over the slaves, you <b>must</b> set:
  <pre>export P4_RSHCOMMAND=rsh</pre>
  <p>in the job script. Otherwise, always the built in /usr/bin/rsh
will be used. For the usage of ssh please refer to the the appropriate
HowTo for ssh, and leave the settings for MPICH to rsh to access the
rsh-wrapper of SGE, which will then use ssh in the end.</p>
</blockquote>
<p>
</p>
<hr><br>
<font size="+1"><b>Hint for Turbomole (no longer applies when HP-MPI /
Platform-MPI is used with recent versions of Turbomole)<br>
</b></font>
<p></p>
<blockquote>Turbomole is using the ch_p4 device of MPICH. But due to
the fact, that this program will always startup one slave process more
(as server task without CPU load) than parallel slots requested, you
have to set "job_is_first_task" to false, to allow this qrsh call to be
made. As mentioned above, MPICH will make only (n-1) calls to qrsh, but
Turbomole will request (n+1) tasks, so that you end up with n again.
  <p>Because one call more to qrsh is made by TURBOMOLE than SGE is
aware of (because of the behaviour of MPICH to start one task locally
without qrsh, but still removing one line during the first scan of the
machines file), the created machines file on the master node will have
one line less than necessary. This yields to a second scan of the
machines file, starting at the beginning. If the first line in the
machines file is also the master node where the server task for
Turbomole is running, all is in best order, because there is already
the server process running which doesn't use any CPU time, and another
working slave task is welcome. But in all other cases, you will have
the slave processes uneven distributed to the slave nodes. Therefore
just add one line to the startmpi.sh in the $SGE_ROOT/mpi directory in
the already existing, to create one entry more with the name of the
master node:</p>
  <pre>if [ $unique = 1 ]; then<br>    PeHostfile2MachineFile $pe_hostfile | uniq &gt;&gt; $machines<br>else<br>    PeHostfile2MachineFile $pe_hostfile &gt;&gt; $machines<br>   <br>#<br># Append own hostname to the machines file.<br>#<br>   <br>    echo `hostname` &gt;&gt; $machines<br>fi</pre>
  <p>This has the positive side effect, to deliver one entry in the
machines file which can be at least removed at all during the first
scan by MPICH. It will remove the line with `hostname`. If there is
none, nothing is removed. This may be the case, when you have two
network cards in the slaves, and `hostname` gives the name of the
external interface, but the created hostlist by SGE contains only the
internal names.</p>
</blockquote>
</body>
</html>
